{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Car prices data preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before running this code, make sure you've downloaded the data CSV file from https://www.kaggle.com/austinreese/craigslist-carstrucks-data.\n",
    "\n",
    "You may have to create a Kaggle account to download the data.\n",
    "\n",
    "After downloading it, extract the ZIP file and make sure you have a file named `vehicles.csv` in the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "used_cars = pd.read_csv(\"vehicles.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset includes the following columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',\n",
       "       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',\n",
       "       'transmission', 'vin', 'drive', 'size', 'type', 'paint_color',\n",
       "       'image_url', 'description', 'county', 'state', 'lat', 'long'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "used_cars.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll use just a subset of these columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = used_cars[\n",
    "    [\n",
    "        \"price\",\n",
    "        \"year\",\n",
    "        \"condition\",\n",
    "        \"cylinders\",\n",
    "        \"fuel\",\n",
    "        \"odometer\",\n",
    "        \"transmission\",\n",
    "        \"size\",\n",
    "        \"type\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some lines have incomplete data. In this example, you'll just drop them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = u.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(70256, 9)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 70256 rows with complete data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's also important to take a look at the categorical variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['good', 'fair', 'excellent', 'like new', 'salvage', 'new'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"condition\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['8 cylinders', '6 cylinders', '4 cylinders', '5 cylinders',\n",
       "       '10 cylinders', '3 cylinders', 'other', '12 cylinders'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"cylinders\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['diesel', 'gas', 'electric', 'hybrid', 'other'], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"fuel\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['automatic', 'manual', 'other'], dtype=object)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"transmission\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['full-size', 'mid-size', 'compact', 'sub-compact'], dtype=object)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"size\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['truck', 'SUV', 'sedan', 'mini-van', 'hatchback', 'coupe',\n",
       "       'pickup', 'wagon', 'convertible', 'van', 'other', 'offroad', 'bus'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u[\"type\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's difficult to build a model that generalizes to a wide range of types of car. So, you'll consider only the samples with more usual features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = u[u.cylinders.isin([\"4 cylinders\", \"6 cylinders\"])]\n",
    "u = u[u.fuel.isin([\"gas\", \"diesel\"])]\n",
    "u = u[u.transmission.isin([\"automatic\", \"manual\"])]\n",
    "u = u[u.type.isin([\"sedan\", \"coupe\", \"wagon\", \"hatchback\"])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(23382, 9)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another problem is that the price is set to zero for some rows. Also, there are rows with very high prices, some of these due to input errors. Here, you'll just filter out these rows, considering just the ones in which 0 < price < 40000:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = u[u[\"price\"] > 0]\n",
    "u = u[u[\"price\"] < 40000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(22355, 9)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `odometer` column also includes several rows with very high values. For this model, you'll consider only rows with `odometer` < 100000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = u[u.odometer < 1e5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9800, 9)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To build a consistent model, you'll consider only cars with `year` > 2000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "u = u[u.year > 2000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9332, 9)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the filtering, you get a dataset with 9332 samples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, you'll take a look at the data types used in the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "price             int64\n",
       "year            float64\n",
       "condition        object\n",
       "cylinders        object\n",
       "fuel             object\n",
       "odometer        float64\n",
       "transmission     object\n",
       "size             object\n",
       "type             object\n",
       "dtype: object"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can notice, `year` and `odometer` are set as `float64` columns, which isn't ideal. The data type of these columns can be converted to `int` with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "u.loc[:, \"year\"] = u.loc[:, \"year\"].astype(int)\n",
    "u.loc[:, \"odometer\"] = u.loc[:, \"odometer\"].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, save the filtered dataset to a CSV file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "u.to_csv(\"vehicles_cleaned.csv\", index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
